Skip to content

Add Q4_K/Q5_K/Q6_K GPU support via Q8_0 dequantization#108

Merged
orionpapadakis merged 2 commits into
beehive-lab:mainfrom
AdamBien:main
May 29, 2026
Merged

Add Q4_K/Q5_K/Q6_K GPU support via Q8_0 dequantization#108
orionpapadakis merged 2 commits into
beehive-lab:mainfrom
AdamBien:main

Conversation

@AdamBien
Copy link
Copy Markdown
Contributor

  • Add GPU support for K-quant models (Q4_K_M, Q5_K_M, Q6_K) via load-time dequantization to Q8_0
  • New FloatTensor implementations: Q4_KFloatTensor, Q5_KFloatTensor, Q6_KFloatTensor
  • Dequantization correctly handles TornadoVM's 16-byte ARRAY_HEADER memory layout
  • Centralize weight loading log message in AbstractModelLoader (shows actual model quantization, e.g. "Q4_K_M -> Q8_0")

Tested with:
./llamaTornado --gpu --verbose-init --metal --model /Users/abien/work/workspaces/llms/Devstral-Small-2-24B-Instruct-2512-Q4_K_M.gguf --prompt "who are you?" --gpu-memory 30GB

@CLAassistant
Copy link
Copy Markdown

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
You have signed the CLA already but the status is still pending? Let us recheck it.

Copy link
Copy Markdown
Collaborator

@orionpapadakis orionpapadakis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@orionpapadakis
Copy link
Copy Markdown
Collaborator

For future reference. Notice the GGUF Model Load time between "direct" execution (i.e. Llama Q8_0) and "dequantization" (i.e. Llama Q4_K_M). This will be addressed by #118

  • ./llama-tornado --gpu --ptx --model ~/LLMModels/Llama-3.2-1B-Instruct-Q8_0.gguf --prompt "tell me a joke" --max-tokens 2048 --verbose-init
WARNING: Using incubator modules: jdk.incubator.vector
Loading model weights in TornadoVM format (Q8_0 -> Q8_0)

Starting TornadoVM initialization...
Here's one:

What do you call a fake noodle?

(wait for it...)

An impasta!

Hope that made you laugh! Do you want to hear another one?

==== Performance Metrics ====
achieved tok/s: 67.62. Tokens: 50, seconds: 0.74
GGUF Model Load: 681.63 ms
Compilation & CodeGen: 557.64 ms
Warmup: 2659.84 ms
Read-only weights Copy-in: 453.29 ms
  • ./llama-tornado --gpu --ptx --model ~/LLMModels/Llama-3.2-1B-Instruct-Q4_K_M.gguf --prompt "tell me a joke" --max-tokens 2048 --verbose-init
WARNING: Using incubator modules: jdk.incubator.vector
Loading model weights in TornadoVM format (Q4_K_M -> Q8_0)

Starting TornadoVM initialization...
Here's one:

What do you call a fake noodle?

(wait for it...)

An impasta!

Hope that made you laugh! Do you want to hear another one?

==== Performance Metrics ====
achieved tok/s: 65.47. Tokens: 50, seconds: 0.76
GGUF Model Load: 23599.27 ms
Compilation & CodeGen: 548.25 ms
Warmup: 998.89 ms
Read-only weights Copy-in: 256.44 ms

  • ./llama-tornado --gpu --ptx --model ~/LLMModels/Llama-3.2-1B-Instruct-Q5_K_M.gguf --prompt "tell me a joke" --max-tokens 2048 --verbose-init
WARNING: Using incubator modules: jdk.incubator.vector
Loading model weights in TornadoVM format (Q5_K_M -> Q8_0)

Starting TornadoVM initialization...
Here's one:

What do you call a fake noodle?

(wait for it...)

An impasta!

Hope that made you laugh!

==== Performance Metrics ====
achieved tok/s: 67.68. Tokens: 42, seconds: 0.62
GGUF Model Load: 24139.17 ms
Compilation & CodeGen: 532.88 ms
Warmup: 943.10 ms
Read-only weights Copy-in: 244.08 ms
  • ./llama-tornado --gpu --ptx --model ~/LLMModels/granite-4.0-1b-Q4_K_M.gguf --prompt "tell me a joke" --max-tokens 2048 --verbose-init
WARNING: Using incubator modules: jdk.incubator.vector
Loading model weights in TornadoVM format (Q4_K_M -> Q8_0)

Starting TornadoVM initialization...
Sure, here's a joke for you:

Why don't scientists trust atoms?

Because they make up everything!

This joke plays on the double meaning of "make up." In science, atoms are the basic building blocks of matter, and they "make up" all the substances we observe. However, the phrase "make up" is also used to mean fabricate or lie about something. So, the joke suggests that atoms are not trustworthy because they "make up" everything, implying they fabricate or lie about their existence.

==== Performance Metrics ====
achieved tok/s: 15.58. Tokens: 119, seconds: 7.64
GGUF Model Load: 28379.93 ms
Compilation & CodeGen: 527.33 ms
Warmup: 5179.27 ms
Read-only weights Copy-in: 372.44 ms
  • ./llama-tornado --gpu --ptx --model ~/LLMModels/Qwen3-1.7B-Q4_K_M.gguf --prompt "tell me a joke /no_think" --max-tokens 2048 --verbose-init
WARNING: Using incubator modules: jdk.incubator.vector
Loading model weights in TornadoVM format (Q4_K_M -> Q8_0)

Starting TornadoVM initialization...
<think>

</think>

Sure! Here's a light-hearted joke for you:

Why don't scientists trust atoms? Because they never trust **their** colleagues. 😄

Let me know if you want another one!

==== Performance Metrics ====
achieved tok/s: 28.01. Tokens: 59, seconds: 2.11
GGUF Model Load: 30445.23 ms
Compilation & CodeGen: 559.15 ms
Warmup: 3979.90 ms
Read-only weights Copy-in: 355.77 ms
  • ./llama-tornado --gpu --ptx --model ~/LLMModels/Mistral-7B-Instruct-v0.3.Q4_K_M.gguf --prompt "tell me a joke" --max-tokens 2048 --verbose-init
WARNING: Using incubator modules: jdk.incubator.vector
Loading model weights in TornadoVM format (Q4_K_M -> Q8_0)

Starting TornadoVM initialization...
 Why did the tomato turn red?

Because it saw the salad dressing!

(This is a play on words, as tomatoes are red before they are added to a salad, and the phrase "saw the salad dressing" is meant to be humorous because tomatoes cannot see.)

==== Performance Metrics ====
achieved tok/s: 15.72. Tokens: 77, seconds: 4.90
GGUF Model Load: 107581.08 ms
Compilation & CodeGen: 628.44 ms
Warmup: 4573.61 ms
Read-only weights Copy-in: 773.35 ms

@orionpapadakis orionpapadakis merged commit b94b20f into beehive-lab:main May 29, 2026
8 of 9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants